Compiling and Processing Historical and Contemporary Portuguese Corpora
نویسنده
چکیده
[email protected] University of Cologne, Albertus-Magnus Platz, 50923 Cologne, Germany Abstract This technical report describes the framework used for processing three large Portuguese corpora. Two corpora contain texts from newspapers, one published in Brazil and the other published in Portugal. The third corpus is Colonia, a historical Portuguese collection containing texts written between the 16 and the early 20 century. The report presents pre-processing methods, segmentation, and annotation of the corpora as well as indexing and querying methods. Finally, it presents published research papers using the corpora.
منابع مشابه
Building a Corpus-based Historical Portuguese Dictionary: Challenges and Opportunities
Historical corpora are important resources for different areas. Philology, Human Language Technology, Literary Studies, History, and Lexicography are some that benefit from them. However, compiling historical corpora is different from compiling contemporary corpora. Corpus designers have to deal with several characteristics inherent in historical texts, such as: absence of a spelling standard, ...
متن کاملGrammatical Annotation of Historical Portuguese: Generating a Corpus-Based Diachronic Dictionary
In this paper, we present an automatic system for the morphosyntactic annotation and lexicographical evaluation of historical Portuguese corpora. Using rule-based orthographical normalization, we were able to apply a standard parser (PALAVRAS) to historical data (Colonia corpus) and to achieve accurate annotation for both POS and syntax. By aligning original and standardized word forms, our met...
متن کاملProviding Internet Access to Portuguese Corpora: the AC/DC Project
In this paper we report on the activity of the project Computational Processing of Portuguese (Processamento computacional do português) in what concerns providing access to Portuguese corpora through the Internet. One of its activities, the AC/DC project (Acesso a corpora/Disponibilização de Corpora, roughly "Access and Availability of Corpora") allows a user to query around 40 million words o...
متن کاملSpeech Recognition for Brazilian Portuguese using the Spoltech and OGI-22 Corpora
Speech processing is a data-driven technology that relies on public corpora and associated resources. In contrast to languages such as English, there are few resources for Brazilian Portuguese (BP). This work describes efforts toward decreasing such gap and presents systems for speech recognition in BP using two public corpora: Spoltech and OGI-22. The following resources are made available: AT...
متن کاملCommentary: Portuguese crypto-Jews: the genetic heritage of a complex history
Nogueiro et al. (2015) utilize Y chromosome and mitochondrial genotype data from a contemporary Iberian and non-Iberian human populations to explore the genetic identity of Portuguese “crypto-Jews.” In the first section of the paper, a historical introduction reviews the plight of Jews in the Iberian Peninsula from the earliest archaeological evidence, through the Inquisition, to the current da...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1710.00803 شماره
صفحات -
تاریخ انتشار 2017